Engee documentation
Notebook

Regression task for predicting medical costs

Introduction

In this example, an exploratory data analysis will be performed, including an initial inspection of the set structure, studying the distributions of features and the target variable, identifying omissions, outliers and duplicates, as well as analyzing correlations between variables. At the next stage, regression models will be built and trained in order to predict the target variable and evaluate the quality of their work using appropriate metrics. Additionally, an analysis of the importance of the features will be performed, which will determine which of them make the greatest contribution to the predictions of the model and can be considered the most significant for the task under study.

Libraries must be installed

You must manually specify the paths where the project is located in order to go to the working directory, as well as install the necessary dependencies.

Before starting work, we will import all the necessary libraries.

In [ ]:
 !pip install -r /user/Demo_public/biomedical/predictionmeddata/requirements.txt
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import randint, uniform, loguniform
from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor

We count the data and see what they are.

In [ ]:
df = pd.read_csv("insurance.csv")
df.head(10)
Out[0]:
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
5 31 female 25.740 0 no southeast 3756.62160
6 46 female 33.440 1 no southeast 8240.58960
7 37 female 27.740 3 no northwest 7281.50560
8 37 male 29.830 2 no northeast 6406.41070
9 60 female 25.840 0 no northwest 28923.13692

Let's analyze the table:

  1. age — age
  2. sex — gender
  3. BMI — body mass index
  4. children — the number of children
  5. smoker — smokes or not
  6. region — region of residence
  7. charges — medical expenses

Based on this table, we have features 1 through 6 - input data to the regression model, feature 7 - the target that needs to be predicted based on features 1-6.

EDA analysis

Let's check the table for gaps

In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

There are no passes. We see in the table 3 categorical features that will later be encoded for training the model, the rest of the features are numeric. You can also see that there are no gaps.

In [ ]:
df.describe()
Out[0]:
age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010

Statistics show that the average age of people is 39 years old with a deviation of 14, that is, it ranges from 25 to 53. It is also seen that bmi = 53 is present - this is the maximum value, which greatly exceeds the critical threshold. Maybe it's an outlier. It is also seen that half of people have a bmi of more than 30, meaning most suffer from the initial stages of obesity, and a quarter of clients have the last degree of obesity. The maximum cost is 63k conventional units (CU), which is very out of the picture (mean + 3sigma, mean -3sigma), perhaps this observation is also an outlier

Classical machine learning models cannot work with categorical features, that is, features that are not numbers. They need to be encoded into numbers. We use the One Hot Encoding technique

In [ ]:
df_encode = pd.get_dummies(df, columns=["sex", "region", "smoker"])
df_encode = df_encode.drop('sex_female', axis=1)
df_encode = df_encode.drop('smoker_no', axis=1)
df_encode.head()
Out[0]:
age bmi children charges sex_male region_northeast region_northwest region_southeast region_southwest smoker_yes
0 19 27.900 0 16884.92400 False False False False True True
1 18 33.770 1 1725.55230 True False False True False False
2 28 33.000 3 4449.46200 True False False True False False
3 33 22.705 0 21984.47061 True False True False False False
4 32 28.880 0 3866.85520 True False True False False False

First, let's study the distribution of data. Let's build histograms for the signs

In [ ]:
colors = ['blue', 'green', 'red', 'purple', 'orange']
fig, axes = plt.subplots(nrows=4, ncols=1,figsize=(7, 7))
axes = axes.flatten()
axes[0].hist(df["age"], bins=10, color=colors[0])
axes[0].set_title('Age')
axes[0].set_xlabel("Age of people")
axes[0].set_ylabel("Number of people")

axes[1].hist(df["bmi"], bins=10, color=colors[0])
axes[1].set_title('bmi')
axes[1].set_xlabel("bmi")
axes[1].set_ylabel("Number of people")

axes[2].hist(df["children"], bins=6, color=colors[0])
axes[2].set_title('Children')
axes[2].set_xlabel("Number of children")
axes[2].set_ylabel("Number of people")

axes[3].hist(df["charges"], bins=10, color=colors[0])
axes[3].set_title('Expenses')
axes[3].set_xlabel("People's expenses")
axes[3].set_ylabel("Number of people")
plt.subplots_adjust(hspace=0.6)
plt.tight_layout()

By age, it can be seen that the data in the sample is fairly evenly distributed: there are both young (over 18 years old) and elderly (under 64 years old), but there are no noticeable distortions.

The BMI distribution has the appearance of a normal distribution, like a bell shifted to the right. Most people are in the range of 25 to 35, which corresponds to being overweight and obese. There are clients with very high BMI (>40).

In terms of the number of children, the distribution is sharply skewed: most clients do not have children.

It can be seen from the expenses that most people prefer to spend less than 30k units.

Next, we will analyze how one or another feature depends on each other. Consider the correlation of features. We will use the Spearman correlation, since it does not require the normality of the distribution.

In [ ]:
spearman_corr = df_encode.corr(method='spearman')
spearman_corr['charges'].sort_values(ascending=False)
Out[0]:
charges             1.000000
smoker_yes          0.663460
age                 0.534392
children            0.133339
bmi                 0.119396
region_northeast    0.046109
region_southeast    0.017275
sex_male            0.009490
region_northwest   -0.021634
region_southwest   -0.042354
Name: charges, dtype: float64

Based on the table above, it can be concluded that smoking and age factors have a high correlation with expenses, that is, if a person smokes, then he has more expenses due to the need to spend on cigarettes. Similarly with age. Depending on the age, a person has different expenses. You can also understand that the number of children does not greatly affect spending (you can see that most of those observed do not have children, so the correlation with spending is small), as well as the region of residence, gender and bmi.

Let's see how categorical features affect spending.

In [ ]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sns.boxplot(y="region", x="charges", data=df, ax=axes[0])
axes[0].set_title("Spending by region")

sns.boxplot(y="sex", x="charges", data=df, ax=axes[1])
axes[1].set_title("Spending by gender")

sns.boxplot(y="smoker", x="charges", data=df, ax=axes[2])
axes[2].set_title("Smoking expenses")

plt.tight_layout()
plt.show()

It can be seen from the regions that there are practically no differences in costs. Regions have similar distributions, and emissions are found everywhere.

There are also no noticeable differences in gender.

The situation with smoking is reversed. For non—smokers, the costs are less than 10 thousand, and for smokers - about 35 thousand. The spread among smokers is also much wider, which indicates that there are a large number of cases with extremely high costs.

In [ ]:
plt.figure(figsize=(8,5))
sns.scatterplot(x="age", y="charges", hue="smoker", data=df, alpha=0.6)
plt.title("Dependence of expenses on age, taking into account smoking")
plt.xlabel("Age")
plt.ylabel("Expenses ")
plt.show()

You can see that the costs increase linearly with age, that is, servicing older people is more expensive. It is also clear that there is a certain coefficient by which the dependence of expenses on age is shifted in the presence of the smoking factor.

In [ ]:
df['bmi_group'] = pd.cut(df['bmi'], bins=[0, 25, 30, 100],
                         labels=['Standard', 'Overweight', 'Fatness'])
In [ ]:
g = sns.FacetGrid(df, col="smoker", hue="bmi_group", height=5)
g.map(sns.scatterplot, "age", "charges", alpha=0.6).add_legend()
plt.show()

The following conclusions can be drawn from the graphs above:

For non-smokers, expenses increase with age, but generally remain within moderate limits. Even with obesity, the costs usually do not exceed 20-25 thousand. This suggests that without the smoking factor, the effect of weight is not so pronounced.

Two lines are clearly visible among smokers: one group of smokers spends on average about 20 thousand - people with normal weight or overweight, and the other group is consistently higher — about 35-50 thousand - these are obese smokers.

Let's check whether the number of children affects people's expenses.

In [ ]:
plt.figure(figsize=(8,5))
sns.barplot(x="children", y="charges", data=df, estimator=lambda x: x.mean())
plt.title("Average expenses by number of children")
plt.xlabel("Number of children")
plt.ylabel("Average expenses")
plt.show()

The number of children has virtually no effect on medical expenses. The only thing is when there are 2-3 children, the expenses are slightly higher than average

Let's look at the overall picture of the correlation of signs, draw a heat map, which is shown below.

In [ ]:
sns.heatmap(df_encode.corr(method='spearman'), cmap="coolwarm", annot=False)
Out[0]:
<Axes: >

If you look at the correlation by signs, you can also see that there is a slight relationship between bmi and the region of residence, and in particular the "Southwest" region. That is, they have a more pronounced linear relationship.

Model training

The list of models that we will evaluate during training

  1. Linear Regression
  2. Random Forest
  3. CatBoost Regressor

Liner Regression (Lasso)

Let's create a table where we will record the results of model training.

In [ ]:
Result_model = pd.DataFrame(columns=["Model", "MAE", "RMSE", "R2"])

For linear models, it is important to work correctly with categorical variables, which is why we encoded them using One Hot Encoding.

In the code below, y is our target variable, costs, and X is the features on which the model will be trained. The overall data set is divided into a test and a training set, so that after training the model, we can evaluate its generalizing ability on data that it has not yet encountered.

Metrics that we will use to evaluate the trained model:

  1. MAE is the average absolute error, it will show how much the model is wrong on average.
  2. RMSE is the root-mean-square error, large errors are more heavily penalized
  3. R2 is the coefficient of determination, which shows how well the model explains the "data
In [ ]:
X = df_encode.drop("charges", axis=1)
y = df_encode["charges"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The first model will be Lasso. The code below provides training, calculation of metrics, and output of those features that the model considers less informative. However, even such signs can have a significant contribution.

In [ ]:
lasso = Lasso(alpha=0.5)

lasso.fit(X_train, y_train)

y_pred_lasso = lasso.predict(X_test)

mae_lasso = mean_absolute_error(y_test, y_pred_lasso)
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
r2_lasso = r2_score(y_test, y_pred_lasso)

print("Lasso Regression")
print("MAE:", mae_lasso)
print("RMSE:", rmse_lasso)
print("R²:", r2_lasso)

coef_lasso = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": lasso.coef_
}).sort_values(by="Coefficient", ascending=False)

print("Lasso Coefficients:")
print(coef_lasso)
Lasso Regression
MAE: 3787.6327653920976
RMSE: 5729.913847940855
R²: 0.7797405299032881
Коэффициенты Lasso:
            Feature   Coefficient
8        smoker_yes  23345.644279
4  region_northeast    939.663756
5  region_northwest    579.789302
2          children    573.769881
1               bmi    369.569828
0               age    257.868504
7  region_southwest     -0.000000
3          sex_male   -167.733537
6  region_southeast   -478.089891

Adding metric values to the table

In [ ]:
row = pd.DataFrame({
    "Model": ["Lasso"],
    "MAE": [mae_lasso],
    "RMSE": [rmse_lasso],
    "R2": [r2_lasso]
})

Result_model = pd.concat([Result_model, row])
/tmp/ipython-input-3838164166.py:8: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  Result_model = pd.concat([Result_model, row])

Random Forest

The next model will be a random forest. The principle is the same, we train, we test, we look at the importance of signs

In [ ]:
rf = RandomForestRegressor(
    n_estimators=100,
    max_depth=4,
    min_samples_split = 20,
    min_samples_leaf = 20
)

rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest")
print("MAE:", mae_rf)
print("RMSE:", rmse_rf)
print("R²:", r2_rf)

importances = pd.DataFrame({
    "Feature": X.columns,
    "Importance": rf.feature_importances_
}).sort_values(by="Importance", ascending=False)

print("\The importance of signs:")
print(importances)
Random Forest
MAE: 2393.117022714948
RMSE: 3944.832628669223
R²: 0.8956011850181776

Важность признаков:
            Feature  Importance
8        smoker_yes    0.700647
1               bmi    0.175183
0               age    0.115571
2          children    0.007219
4  region_northeast    0.000805
3          sex_male    0.000345
7  region_southwest    0.000077
6  region_southeast    0.000077
5  region_northwest    0.000076

Adding metrics to the table

In [ ]:
row = pd.DataFrame({
    "Model": ["Random Forest"],
    "MAE": [mae_rf],
    "RMSE": [rmse_rf],
    "R2": [r2_rf]
})

Result_model = pd.concat([Result_model, row])

It can be seen that the metrics of the random forest model are much better than those of the linear model.

CatBoost Regressor

The next model is the gradient boosting model from the CatBosot package. Let's go through the same pipeline as with the models above.

In [ ]:
cat_features = ["smoker", "sex", "region"]
feature_cols = [c for c in df.columns if c not in ["charges"]]


cat = CatBoostRegressor(
    iterations=1000,
    depth=6,
    learning_rate=0.01,
    verbose=100, random_state=42
)

X = df[feature_cols]
y = df["charges"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

cat.fit(X_train, y_train, cat_features=cat_features, eval_set=(X_test, y_test), verbose=100)

y_pred_cat = cat.predict(X_test)

mae_cat1 = mean_absolute_error(y_test, y_pred_cat)
rmse_cat1 = np.sqrt(mean_squared_error(y_test, y_pred_cat))
r2_cat1 = r2_score(y_test, y_pred_cat)

feature_importance = cat.get_feature_importance(prettified=True)
print(feature_importance)

print("CatBoost")
print("MAE:", mae_cat1)
print("RMSE:", rmse_cat1)
print("R²:", r2_cat1)
0:	learn: 12007.5544064	test: 12015.2822361	best: 12015.2822361 (0)	total: 1.01ms	remaining: 1.01s
100:	learn: 6642.7346665	test: 6462.9453476	best: 6462.9453476 (100)	total: 75.9ms	remaining: 675ms
200:	learn: 5028.4127086	test: 4811.8940123	best: 4811.8940123 (200)	total: 168ms	remaining: 670ms
300:	learn: 4569.0406421	test: 4403.8806511	best: 4403.8806511 (300)	total: 250ms	remaining: 582ms
400:	learn: 4392.6205165	test: 4306.4059699	best: 4306.4059699 (400)	total: 334ms	remaining: 499ms
500:	learn: 4299.2805777	test: 4278.5911022	best: 4278.5911022 (500)	total: 414ms	remaining: 412ms
600:	learn: 4229.7798849	test: 4268.2470356	best: 4268.2470356 (600)	total: 496ms	remaining: 330ms
700:	learn: 4174.5635367	test: 4266.5310336	best: 4265.5489448 (698)	total: 574ms	remaining: 245ms
800:	learn: 4115.5054728	test: 4266.3003965	best: 4265.2015983 (774)	total: 655ms	remaining: 163ms
900:	learn: 4065.5331955	test: 4270.9462529	best: 4265.2015983 (774)	total: 748ms	remaining: 82.2ms
999:	learn: 4012.0603582	test: 4275.5718995	best: 4265.2015983 (774)	total: 833ms	remaining: 0us

bestTest = 4265.201598
bestIteration = 774

Shrink model to first 775 iterations.
  Feature Id  Importances
0     smoker    74.294819
1        bmi    13.216792
2        age     9.892903
3   children     1.253336
4     region     0.901687
5        sex     0.440463
CatBoost
MAE: 2465.2188274260034
RMSE: 4265.2015982088515
R²: 0.8759281804317242

Adding metrics to the table

In [ ]:
row = pd.DataFrame({
    "Model": ["CatBoostRegressor 1ver"],
    "MAE": [mae_cat1],
    "RMSE": [rmse_cat1],
    "R2": [r2_cat1]
})

Result_model = pd.concat([Result_model, row])

Next, we will try to select more optimal hyperparameters for the gradient boosting model. Here, three metrics that I described above will be taken into account at once.

We will search for hyperparameters by randomly selecting the parameters and their ranges that the model works with.

In [ ]:
scoring = {
    "mae": "neg_mean_absolute_error",
    "rmse": "neg_root_mean_squared_error",
    "r2": "r2",
}

cat_base = CatBoostRegressor(
    verbose=False,
    random_state=42,
    task_type="GPU",
    devices="0"
)

param_distributions = {
    "iterations": randint(300, 1000),
    "depth": randint(4, 8),
    "learning_rate": loguniform(1e-3, 3e-1),
    "l2_leaf_reg": loguniform(1e-2, 1e2),
    "bagging_temperature": uniform(0.0, 1.0),
    "random_strength": loguniform(1e-3, 10),
    "grow_policy": ["SymmetricTree", "Depthwise"],
    "border_count": randint(32, 128),
    "leaf_estimation_iterations": randint(1, 3)
}

rs = RandomizedSearchCV(
    estimator=cat_base,
    param_distributions=param_distributions,
    n_iter=20,
    scoring=scoring,
    cv=KFold(n_splits=5, shuffle=True, random_state=42),
    n_jobs=1,
    random_state=42,
    refit=False
)


rs.fit(X_train, y_train, cat_features=cat_features)

Next, we will find the best result among all the metrics. In our case, the coefficient of determination showed generally the best figures for all metrics.

In [ ]:
cv = rs.cv_results_
best_idx = np.argmax(cv["mean_test_r2"])
best_params = cv["params"][best_idx]
print("Best configuration by r2:", best_params)
Лучшая конфигурация по r2: {'bagging_temperature': np.float64(0.3562978380769749), 'border_count': 93, 'depth': 4, 'grow_policy': 'SymmetricTree', 'iterations': 676, 'l2_leaf_reg': np.float64(3.8972698898745795), 'leaf_estimation_iterations': 1, 'learning_rate': np.float64(0.10624824455988377), 'random_strength': np.float64(2.7728241828010627)}

Using the obtained hyperparameters, we will train the model

In [ ]:
best_cat = CatBoostRegressor(
    verbose=False, random_state=42, task_type="GPU", devices="0", **best_params
)
best_cat.fit(X_train, y_train, cat_features=cat_features)

y_pred = best_cat.predict(X_test)

mae  = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2   = r2_score(y_test, y_pred)

print("MAE:", mae)
print("RMSE:", rmse)
print("R²:", r2)

fi = best_cat.get_feature_importance(prettified=True)
print(fi)
MAE: 2349.3107371864035
RMSE: 4288.497881809796
R²: 0.8745691328646068
  Feature Id  Importances
0     smoker    72.454902
1        bmi    14.903237
2        age    10.728109
3   children     0.901835
4     region     0.835978
5        sex     0.175939

And we will also add the calculated metrics to the table.

In [ ]:
row = pd.DataFrame({
    "Model": ["CatBoostRegressor 2ver"],
    "MAE": [mae],
    "RMSE": [rmse],
    "R2": [r2]
})

Result_model = pd.concat([Result_model, row])

Let's do Feature Engineering by creating new features from existing ones. In our case, we divide the age into decades, divide people with children into groups, and also divide the bmi index into classes (that is, by intervals).

In [ ]:
df = pd.read_csv("insurance.csv")
df["age_decade"] = (df["age"] // 10).astype(int).astype("category")
df["children_bucket"] = pd.cut(df["children"], [-1,0,2,99], labels=["0","1-2","3+"])
df["obesity_class"] = pd.cut(df["bmi"],
    [0,18.5,25,30,35,40,1e3], labels=["Under","Normal","Over","Ob1","Ob2", "Ob3"])

After that, we will conduct the same pipeline for training the basic model.

In [ ]:
cat_features = ["smoker", "sex", "region", "obesity_class", "age_decade", "children_bucket"]
feature_cols = [c for c in df.columns if c not in ["charges"]]


cat = CatBoostRegressor(
    iterations=2000,
    depth=6,
    learning_rate=0.01,
    verbose=100, random_state=42
)

X = df[feature_cols]
y = df["charges"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

cat.fit(X_train, y_train, cat_features=cat_features, eval_set=(X_test, y_test), verbose=100)

y_pred_cat = cat.predict(X_test)

mae_cat = mean_absolute_error(y_test, y_pred_cat)
rmse_cat = np.sqrt(mean_squared_error(y_test, y_pred_cat))
r2_cat = r2_score(y_test, y_pred_cat)

feature_importance = cat.get_feature_importance(prettified=True)
print(feature_importance)

print("CatBoost")
print("MAE:", mae_cat)
print("RMSE:", rmse_cat)
print("R²:", r2_cat)
0:	learn: 11933.3722839	test: 12382.0165070	best: 12382.0165070 (0)	total: 3.25ms	remaining: 6.5s
100:	learn: 6553.3062212	test: 6700.8384991	best: 6700.8384991 (100)	total: 199ms	remaining: 3.74s
200:	learn: 4995.8505835	test: 4931.1641999	best: 4931.1641999 (200)	total: 380ms	remaining: 3.4s
300:	learn: 4564.5766881	test: 4458.1505879	best: 4458.1505879 (300)	total: 558ms	remaining: 3.15s
400:	learn: 4412.3726510	test: 4335.9529290	best: 4335.9529290 (400)	total: 742ms	remaining: 2.96s
500:	learn: 4326.2994050	test: 4286.3989761	best: 4286.3989761 (500)	total: 925ms	remaining: 2.77s
600:	learn: 4250.9120302	test: 4263.9524398	best: 4263.9524398 (600)	total: 1.12s	remaining: 2.6s
700:	learn: 4183.8125620	test: 4250.3936707	best: 4250.3936707 (700)	total: 1.33s	remaining: 2.46s
800:	learn: 4112.8395328	test: 4241.2414234	best: 4241.2414234 (800)	total: 1.52s	remaining: 2.28s
900:	learn: 4045.4146292	test: 4234.0665711	best: 4233.8518279 (895)	total: 1.71s	remaining: 2.09s
1000:	learn: 3984.3708564	test: 4224.9261609	best: 4224.9201685 (999)	total: 1.91s	remaining: 1.9s
1100:	learn: 3930.0179700	test: 4219.6165314	best: 4219.5164715 (1098)	total: 2.11s	remaining: 1.72s
1200:	learn: 3874.3713973	test: 4211.8745119	best: 4211.6638182 (1196)	total: 2.33s	remaining: 1.55s
1300:	learn: 3827.2743082	test: 4206.4109026	best: 4206.2045928 (1295)	total: 2.55s	remaining: 1.37s
1400:	learn: 3780.7948401	test: 4204.2963017	best: 4204.2062397 (1397)	total: 2.75s	remaining: 1.18s
1500:	learn: 3736.6273160	test: 4200.9151489	best: 4200.2939566 (1472)	total: 2.95s	remaining: 981ms
1600:	learn: 3693.6442830	test: 4197.3577996	best: 4197.3577996 (1600)	total: 3.16s	remaining: 787ms
1700:	learn: 3656.7214768	test: 4196.6087530	best: 4196.2931068 (1667)	total: 3.39s	remaining: 595ms
1800:	learn: 3614.5572627	test: 4191.9198802	best: 4191.6668283 (1788)	total: 3.59s	remaining: 397ms
1900:	learn: 3577.5533875	test: 4192.1423002	best: 4190.3489371 (1880)	total: 3.79s	remaining: 198ms
1999:	learn: 3538.0569603	test: 4189.4415208	best: 4189.1639386 (1972)	total: 4s	remaining: 0us

bestTest = 4189.163939
bestIteration = 1972

Shrink model to first 1973 iterations.
        Feature Id  Importances
0           smoker    72.348387
1              bmi    12.427742
2              age     5.875907
3       age_decade     3.649860
4    obesity_class     1.693303
5           region     1.549990
6  children_bucket     1.130091
7         children     1.018538
8              sex     0.306182
CatBoost
MAE: 2282.9523492387857
RMSE: 4189.163939059238
R²: 0.8869614306043833

Well, let's write down the metrics in the table.

In [ ]:
row = pd.DataFrame({
    "Model": ["CatBoostRegressor 3ver"],
    "MAE": [mae_cat],
    "RMSE": [rmse_cat],
    "R2": [r2_cat]
})

Result_model = pd.concat([Result_model, row])

Next, let's check how the model with the best parameters, which were found through a random search, behaves on a dataset with new features.

In [ ]:
cat_features = ["smoker", "sex", "region", "obesity_class", "age_decade", "children_bucket"]
feature_cols = [c for c in df.columns if c not in ["charges"]]


best_cat = CatBoostRegressor(
    verbose=False, random_state=42, task_type="GPU", devices="0", **best_params
)

X = df[feature_cols]
y = df["charges"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

cat.fit(X_train, y_train, cat_features=cat_features, eval_set=(X_test, y_test), verbose=100)

y_pred_cat = cat.predict(X_test)

mae_cat = mean_absolute_error(y_test, y_pred_cat)
rmse_cat = np.sqrt(mean_squared_error(y_test, y_pred_cat))
r2_cat = r2_score(y_test, y_pred_cat)

feature_importance = cat.get_feature_importance(prettified=True)
print(feature_importance)

print("CatBoost")
print("MAE:", mae_cat)
print("RMSE:", rmse_cat)
print("R²:", r2_cat)
0:	learn: 11969.9835938	test: 12238.5188345	best: 12238.5188345 (0)	total: 3.93ms	remaining: 7.86s
100:	learn: 6517.5611660	test: 6334.9234598	best: 6334.9234598 (100)	total: 184ms	remaining: 3.46s
200:	learn: 4968.6564731	test: 4634.0821533	best: 4634.0821533 (200)	total: 370ms	remaining: 3.31s
300:	learn: 4530.7969731	test: 4215.6273929	best: 4215.6273929 (300)	total: 575ms	remaining: 3.25s
400:	learn: 4378.8264679	test: 4131.8317375	best: 4131.8317375 (400)	total: 780ms	remaining: 3.11s
500:	learn: 4291.5397532	test: 4103.8390627	best: 4103.8390627 (500)	total: 971ms	remaining: 2.9s
600:	learn: 4219.7399430	test: 4096.2305784	best: 4095.6900510 (581)	total: 1.17s	remaining: 2.73s
700:	learn: 4155.1394118	test: 4092.7798366	best: 4091.9668157 (688)	total: 1.39s	remaining: 2.57s
800:	learn: 4095.8480642	test: 4095.3161950	best: 4091.9668157 (688)	total: 1.61s	remaining: 2.4s
900:	learn: 4042.9257970	test: 4099.4890827	best: 4091.9668157 (688)	total: 1.83s	remaining: 2.23s
1000:	learn: 3990.8418157	test: 4108.1147734	best: 4091.9668157 (688)	total: 2.04s	remaining: 2.03s
1100:	learn: 3944.5296308	test: 4111.2769369	best: 4091.9668157 (688)	total: 2.25s	remaining: 1.84s
1200:	learn: 3905.1795316	test: 4116.7957579	best: 4091.9668157 (688)	total: 2.48s	remaining: 1.65s
1300:	learn: 3861.9687555	test: 4122.1770834	best: 4091.9668157 (688)	total: 2.69s	remaining: 1.45s
1400:	learn: 3821.8052445	test: 4129.8965665	best: 4091.9668157 (688)	total: 2.9s	remaining: 1.24s
1500:	learn: 3775.5377192	test: 4137.3658247	best: 4091.9668157 (688)	total: 3.11s	remaining: 1.03s
1600:	learn: 3732.5373524	test: 4147.7384160	best: 4091.9668157 (688)	total: 3.33s	remaining: 831ms
1700:	learn: 3685.3535428	test: 4156.7425063	best: 4091.9668157 (688)	total: 3.57s	remaining: 627ms
1800:	learn: 3638.3932733	test: 4168.5845339	best: 4091.9668157 (688)	total: 3.77s	remaining: 417ms
1900:	learn: 3599.5367201	test: 4174.4762325	best: 4091.9668157 (688)	total: 3.99s	remaining: 208ms
1999:	learn: 3561.8668688	test: 4183.1028730	best: 4091.9668157 (688)	total: 4.2s	remaining: 0us

bestTest = 4091.966816
bestIteration = 688

Shrink model to first 689 iterations.
        Feature Id  Importances
0           smoker    76.880263
1              bmi    12.396217
2              age     4.684975
3       age_decade     3.173683
4         children     0.866415
5    obesity_class     0.768850
6  children_bucket     0.572277
7           region     0.475337
8              sex     0.181983
CatBoost
MAE: 2373.116962536972
RMSE: 4091.966861290222
R²: 0.8900345281700345

As usual, we record the results in a table.

In [ ]:
row = pd.DataFrame({
    "Model": ["CatBoostRegressor 4ver"],
    "MAE": [mae_cat],
    "RMSE": [rmse_cat],
    "R2": [r2_cat]
})

Result_model = pd.concat([Result_model, row])

Let's conduct an experiment. We'll find the outliers and delete them. Let's check how the gradient booster model behaves.

In this case, we are deleting data that can greatly distort the picture for the model and spoil the metrics, of course.

We delete the following data:
0.05 is the quantile— the value below which 5% of observations are located.

0.95 is the quantile value above which 5% of observations are found.

In [ ]:
df = pd.read_csv("insurance.csv")

numeric_cols = df.select_dtypes(include=[np.number]).columns

low = df[numeric_cols].quantile(0.05)
high = df[numeric_cols].quantile(0.95)

df_no_outliers = df[~((df[numeric_cols] < low) | (df[numeric_cols] > high)).any(axis=1)]

We will display information about the dataset

In [ ]:
df_no_outliers.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1021 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1021 non-null   int64  
 1   sex       1021 non-null   object 
 2   bmi       1021 non-null   float64
 3   children  1021 non-null   int64  
 4   smoker    1021 non-null   object 
 5   region    1021 non-null   object 
 6   charges   1021 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 63.8+ KB

It can be seen that about 300 copies of the data have been deleted. In this case, we lose 20 percent of all information, which is not recommended, but we look at how these 20 percent affect the final prediction of the model.

Let's build the boxplots that were built in the paragraph above and estimate the distribution of the data.

In [ ]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sns.boxplot(y="region", x="charges", data=df_no_outliers, ax=axes[0])
axes[0].set_title("Spending by region")

sns.boxplot(y="sex", x="charges", data=df_no_outliers, ax=axes[1])
axes[1].set_title("Spending by gender")

sns.boxplot(y="smoker", x="charges", data=df_no_outliers, ax=axes[2])
axes[2].set_title("Smoking expenses")

plt.tight_layout()
plt.show()

It can be seen that there are slightly fewer emissions

Next, we will train the gradient boosting model on the cropped dataset and check the metrics

In [ ]:
cat_features = ["smoker", "sex", "region"]
feature_cols = [c for c in df.columns if c not in ["charges"]]


cat = CatBoostRegressor(
    iterations=2700,
    depth=9,
    learning_rate=0.002,
    verbose=100, random_state=42
)

X = df_no_outliers[feature_cols]
y = df_no_outliers["charges"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

cat.fit(X_train, y_train, cat_features=cat_features, eval_set=(X_test, y_test), verbose=100)

y_pred_cat = cat.predict(X_test)

mae_cat = mean_absolute_error(y_test, y_pred_cat)
rmse_cat = np.sqrt(mean_squared_error(y_test, y_pred_cat))
r2_cat = r2_score(y_test, y_pred_cat)

feature_importance = cat.get_feature_importance(prettified=True)
print(feature_importance)

print("CatBoost")
print("MAE:", mae_cat)
print("RMSE:", rmse_cat)
print("R²:", r2_cat)
0:	learn: 9897.0413125	test: 8906.4267767	best: 8906.4267767 (0)	total: 851us	remaining: 2.3s
100:	learn: 8792.4839207	test: 7812.1211522	best: 7812.1211522 (100)	total: 104ms	remaining: 2.68s
200:	learn: 7891.9644000	test: 6905.9187320	best: 6905.9187320 (200)	total: 220ms	remaining: 2.74s
300:	learn: 7176.7820512	test: 6169.8327394	best: 6169.8327394 (300)	total: 340ms	remaining: 2.71s
400:	learn: 6611.0879826	test: 5576.5673997	best: 5576.5673997 (400)	total: 465ms	remaining: 2.67s
500:	learn: 6164.4200911	test: 5094.1977204	best: 5094.1977204 (500)	total: 614ms	remaining: 2.7s
600:	learn: 5815.7735646	test: 4705.3250723	best: 4705.3250723 (600)	total: 753ms	remaining: 2.63s
700:	learn: 5547.3455048	test: 4402.8985985	best: 4402.8985985 (700)	total: 917ms	remaining: 2.62s
800:	learn: 5339.6025349	test: 4159.2304067	best: 4159.2304067 (800)	total: 1.06s	remaining: 2.52s
900:	learn: 5169.1714018	test: 3959.3475285	best: 3959.3475285 (900)	total: 1.21s	remaining: 2.42s
1000:	learn: 5035.4947364	test: 3810.6290259	best: 3810.6290259 (1000)	total: 1.38s	remaining: 2.34s
1100:	learn: 4924.9501339	test: 3687.2531723	best: 3687.2531723 (1100)	total: 1.54s	remaining: 2.24s
1200:	learn: 4835.3711923	test: 3592.3664282	best: 3592.3664282 (1200)	total: 1.72s	remaining: 2.14s
1300:	learn: 4759.9560259	test: 3516.2057424	best: 3516.2057424 (1300)	total: 1.93s	remaining: 2.07s
1400:	learn: 4692.2595470	test: 3453.3606403	best: 3453.3606403 (1400)	total: 2.12s	remaining: 1.96s
1500:	learn: 4634.5055861	test: 3407.5697656	best: 3407.5697656 (1500)	total: 2.31s	remaining: 1.85s
1600:	learn: 4582.9786436	test: 3371.3128485	best: 3371.3128485 (1600)	total: 2.52s	remaining: 1.73s
1700:	learn: 4539.4351243	test: 3343.9115287	best: 3343.8764532 (1699)	total: 2.7s	remaining: 1.59s
1800:	learn: 4492.5670431	test: 3326.1500465	best: 3326.1500465 (1800)	total: 2.93s	remaining: 1.46s
1900:	learn: 4449.1709672	test: 3312.8196046	best: 3312.6574863 (1899)	total: 3.17s	remaining: 1.33s
2000:	learn: 4414.4671137	test: 3300.9165097	best: 3300.9165097 (2000)	total: 3.36s	remaining: 1.17s
2100:	learn: 4377.7072952	test: 3291.6828146	best: 3291.5949542 (2099)	total: 3.58s	remaining: 1.02s
2200:	learn: 4347.6894171	test: 3284.1062182	best: 3284.0668611 (2197)	total: 3.78s	remaining: 858ms
2300:	learn: 4318.7481154	test: 3280.6572629	best: 3280.2862776 (2286)	total: 4.02s	remaining: 697ms
2400:	learn: 4288.9738170	test: 3277.1930535	best: 3277.1316001 (2394)	total: 4.24s	remaining: 528ms
2500:	learn: 4256.6900365	test: 3274.7293469	best: 3274.0859575 (2482)	total: 4.47s	remaining: 356ms
2600:	learn: 4224.1697094	test: 3274.7086500	best: 3274.0859575 (2482)	total: 4.7s	remaining: 179ms
2699:	learn: 4197.7877597	test: 3274.8881425	best: 3274.0859575 (2482)	total: 4.92s	remaining: 0us

bestTest = 3274.085958
bestIteration = 2482

Shrink model to first 2483 iterations.
  Feature Id  Importances
0     smoker    71.994475
1        age    11.329495
2        bmi    11.115244
3   children     2.485361
4     region     2.124998
5        sex     0.950427
CatBoost
MAE: 2296.9732261320055
RMSE: 3274.0859712161337
R²: 0.8636492205195162

Let's put everything on the tablet

In [ ]:
row = pd.DataFrame({
    "Model": ["CatBoostRegressor 5ver"],
    "MAE": [mae_cat],
    "RMSE": [rmse_cat],
    "R2": [r2_cat]
})

Result_model = pd.concat([Result_model, row], ignore_index=True)

Let's evaluate the results

In [ ]:
Result_model
Out[0]:
Model MAE RMSE R2
0 Lasso 3787.632765 5729.913848 0.779741
1 Random Forest 2393.117023 3944.832629 0.895601
2 CatBoostRegressor 1ver 2465.218827 4265.201598 0.875928
3 CatBoostRegressor 2ver 2349.310737 4288.497882 0.874569
4 CatBoostRegressor 3ver 2282.952349 4189.163939 0.886961
5 CatBoostRegressor 4ver 2373.116963 4091.966861 0.890035
6 CatBoostRegressor 5ver 2296.973226 3274.085971 0.863649

It can be seen from the table below that the Random Forest and gradient boosting models (version 5) showed the best values of the R2 metric. However, it should be borne in mind that the gradient boosting model was trained on transformed data, where potential outliers were removed (even if such clients may actually occur in the sample). At the same time, the Random Forest model demonstrates a higher R2 value on the initial dataset, but its RMSE and MAE indicators are worse. This indicates the presence in the initial data of observations with costs significantly higher than the average level. So, with an average value of about 20 thousand units, there are cases with expenses of about 60 thousand units. Such extreme values increase the errors in the absolute metrics of Random Forest. In contrast, the gradient boosting model, trained without outliers, showed lower MAE and RMSE errors.

Thus, among all the trained models, we will consider the one that has the lowest metrics in total to be the best.

In our case, this is gradient boosting version 5.

What factors influence target the most?

Let's evaluate the importance of the signs with 2 spokes:

  1. Importance based on EDA
  2. Importance gained with the help of models

Importance based on EDA

Let's consider all the signs separately

Smoking (smoker)

The most significant factor. The correlation coefficient between smoking and expenses is 0.66, which indicates a strong relationship.

The graphs show that smokers have significantly higher expenses than non-smokers.

The scatter plot also shows that even with the same age and BMI, smokers have significantly higher costs.
Smoking dramatically increases health risks (heart disease, lung disease, cancer), which leads to higher costs.

Age (age)

Age comes in second place in terms of influence (correlation 0.53).

The graphs show that the older a person is, the higher the costs.

This is especially noticeable for smokers: as they age, their expenses increase, looking a bit like an exponential dependence.
The reason is that with age, the likelihood of chronic diseases and seeking medical help increases.

The presence of children

The correlation is weak (0.13), but still positive.

Families with 2-3 children have higher expenses on average than those without children or with one child.

However, with 5 children, average expenses decrease — this may be due to the characteristics of the sample.
The reason is that having children increases health costs

Body Mass Index (BMI)

The correlation with expenses is weak (0.11), but in combination with other factors, BMI plays a role.

The graphs show that obese people have higher expenses, especially if they smoke.

At the same time, obesity without smoking does not always lead to a sharp increase in costs.
The reason is that obesity is associated with the risks of diabetes and cardiovascular diseases, but these effects are enhanced by other risk factors.

Gender and region (sex, region)

There is no noticeable effect by gender or region.

Correlations are close to zero.

The distribution of expenses is almost the same for men and women, as well as for different regions.
The reason is that insurance rates and medicine do not depend on the region of residence or gender in this sample.

Importance obtained with the help of models

We will see a similar picture of importance if we look at which features have more and which have less weight after training the model.

So, for example, for the Random Forest model, the coefficients of importance of features after training:

  1.           smoker_yes    0.700647
    
  2.           bmi    0.175183
    
  3.           age    0.115571
    

It can be seen that during the EDA it was also found that these very signs are of high importance.

For example, for two models of gradient boosting, it was these features that were important, just with different coefficients.:

  1.    smoker    
    
  2.    bmi   
    
  3.    age    
    

Thus, it can be unequivocally concluded that the most significant signs are the smoking factor, the bmi index, and age.

Conclusion

In the example, an exploratory data analysis was performed, regression models were trained, and the most significant features affecting the target variable were identified.